Christopher Belanger, PhD
Data Scientist, Ottawa Neighbourhood Study
Managing Partner, Belanger Analytics
September 15, 2021
Data Scientist, Ottawa Neighbourhood Study
Managing Partner, Belanger Analytics
Scraping data means transforming human-readable data into machine-readable data.
We scrape data when:
We could also think of it as the creative acquisition of machine-readable data.
This is a perfect use-case for data scraping.
A static website is a collection of html (and other) files that you download from a server and view in your browser.
Because the data is all contained in static files, scraping a static website generally follows this recipe:
rvest::read_html().rvest::html_elements() and rvest::html_attrs().In practice, of course, the steps don’t usually go in this nice order :)
To find all Foodland locations, first we use our browser to find the url for Foodland’s store locator page and use SelectorGadget to find the CSS selector for each store’s information.
Then we can read the site in R and get the store data:
# read website's html
html <- rvest::read_html("https://foodland.ca/store-locator/")
# separate out the sections for each store
stores <- rvest::html_elements(html, css = ".brand-foodland-store-location")
# isolate the first store for testing
store <- stores[[1]]
store
## {html_node}
## <div class="store-result brand-foodland-store-location" data-js-store-result="" data-id="83832" data-lng="-65.8315" data-lat="46.7291" data-city="blackville" data-province="nb" data-postal-code="e9b 1n3" data-taxonomy-services="" data-taxonomy-types="" data-hours="{"monday":"8:00 a.m. to 8:00 p.m.","tuesday":"8:00 a.m. to 8:00 p.m.","wednesday":"8:00 a.m. to 8:00 p.m.","thursday":"8:00 a.m. to 8:00 p.m.","friday":"8:00 a.m. to 8:00 p.m.","saturday":"8:00 a.m. to 8:00 p.m.","sunday":"10:00 a.m. to 6:00 p.m."}" data-brand="foodland-store-location">
## [1] <div class="equal_height">\n\t\t\t\t\t\t\t<div>\n\t\t\t\t\t\t\t\t<h4><a a ...
By inspecting the raw html in our browser with view-source:, we find that some data is stored as invisible attributes. We can extract them with rvest::html_attr():
# extract lat/lon coords using html attributes lat <- rvest::html_attr(store, "data-lat") lon <- rvest::html_attr(store, "data-lng") # print to console c(lat, lon)
## [1] "46.7291" "-65.8315"
Some data is presented only as human-readable text, so we can extract it using CSS selectors that we find again with SelectorGadget.
We use html_elements() to get the html snippets for each item, then html_text() to get the text.
Here we get the city for the first store:
city <- rvest::html_elements(store, css = ".city") rvest::html_text(city)
## [1] "Blackville"
for loop; or,purrr::map() or lapply().For a complete worked example, see the first example workbook for this talk.
If a website has a table of values, you’re in luck:
rvest::html_table() automatically puts tabular data into a structured data frame.If a website has a form you need to fill and submit, you’re not out of luck:
rvest::html_form* can help you to fill and submit web forms automatically and read the responses.view-source: in our browser, the data isn’t there!We’re ready to pull this data directly into R.
# query the extremely long url for the API call
url <- "https://www.circlek.com/stores_new.php?lat=45.421&lng=-75.69&services=®ion=global&page=0"
resp <- httr::GET(url)
# extract the content from the response, and parse the JSON result
stores <- httr::content(resp, type = "text/json", encoding = "UTF-8") %>%
jsonlite::fromJSON()
# inspect the structure of the response
str(stores, max.level = 1)
## List of 5 ## $ count : int 10 ## $ page : int 1 ## $ division : chr "ontario" ## $ stores :List of 10 ## $ tactic_urls:List of 12
Parsing complex lists can be a pain, but Circle K’s response is easy to tidy:
# convert the response to a nested data frame, and then unnest the data stores$stores %>% enframe() %>% unnest_wider(value) %>% select(display_brand, address, city, latitude, longitude) %>% head(5)
## # A tibble: 5 x 5 ## display_brand address city latitude longitude ## <chr> <chr> <chr> <chr> <chr> ## 1 Mac's "11-160 Elgin St., " OTTAWA 45.4197298 -75.6928678 ## 2 Circle K "388 Elgin Street" OTTAWA 45.4144881 -75.6876387 ## 3 Circle K "120 Osgoode St." OTTAWA 45.4237455 -75.6809046 ## 4 Circle K "210 Laurier Avenue East" OTTAWA 45.4256518 -75.6817806 ## 5 Circle K "333 Rideau Street" OTTAWA 45.4296923 -75.6843574
For this API, request parameters are sent in the url after ? and separated by &.
https://www.circlek.com/stores_new.php?lat=45.421&lng=-75.69&services=®ion=global&page=0
So the parameters here are:
lat=45.421, lng=-75.69: The geographic coordinates for the search.services=: Blank in this request; maybe to look for specific services?region=global: Might let you limit searches; not of interest to us.page=0: Aha! This tells the API which page of results to return!page.for loop:# set up an empty tibble for our results
results <- tibble::tibble()
for (page in 0:num_pages){
# assume the base url is in a variable called base_url
url <- paste0(base_url, page)
# call a function to call the API and parse the results
result <- call_circlek_api(url)
# add the result to our big results table
results <- dplyr::bind_rows(results, result)
}
After some (off-screen) data collection, we can plot a heatmap of 9,600 global Circle K locations:
In closing, a few suggestions for web-scraping etiquette:
Sys.sleep().
Data Scientist, Ottawa Neighbourhood Study
Managing Partner, Belanger Analytics
httr::GET() and httr::POST(), and for interpeting their responses with httr::content().